Prerequisites: Hopefully you have a good understanding of how linear regression. Its a good idea to review the slides for this module before you start this exercise. Also take a look at the notebook corresponding to exercise 2a.
We'll now learn to do linear regression. In IPython, we use the library Sci-Kit Learn that implements many of the algorithms for machine learning. This includes linear regression as well as more advanced techniques like SVMs, Neural Networks etc.
In this exercise, we will first make a simple model using just one predictor and then you will build and refine linear models using the techniques discussed in the slides. If you're stuck, several of the included links will help!
In [1]:
# import libraries
import matplotlib
import IPython
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import pylab
import seaborn as sns
import sklearn as sk
%matplotlib inline
If you noticed in the includes codes at the start of the file, this time we included a new library - sklearn. sklearn is short for Sci-Kit Learn, a machine learning library for Python.
The structure of scikit-learn:
Some of the following text is taken from the scikit-learn API paper: http://arxiv.org/pdf/1309.0238v1.pdf
All objects within scikit-learn share a uniform common basic API consisting of three complementary interfaces: an estimator interface for building and fitting models, a predictor interface for making predictions and a transformer interface for converting data.
The estimator interface is at the core of the library. It defines instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (e.g., for classification, regression or clustering) are offered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.
An example along these lines:
linear_model = LinearRegression()
linear_model.fit(LSTAT, MEDV)
If one changes methods, say, to a Logistic regression, one would simply replace LinearRegression()
in the snippet above by LogisticRegression()
.
The predictor interface extends the notion of an estimator by adding a predict method that takes an array
X_test
and produces predictions forX_test
, based on the learned parameters of the estimator. In the case of supervised learning estimators, this method typically returns the predicted labels or values computed by the model. Some unsupervised learning estimators may also implement the predict interface, such as k-means, where the predicted values are the cluster labels.
clf.predict(X_test)
Go back to notebook Ex2a: Supervised Learning and scroll down to where we built the models. Remember how we told you to ignore the code? Its time to understand it now. Undersand how we build a simple linear model and how we used it to make predictions. Now its time to apply that knowledge!
Prompt: Build a predictor for the median house value in towns around Boston! Load up the data given in housing.names and housing.txt and build a linear model trying to predict MEDV using all the other features of the data!
Help: You can see the code for the linear model here and an example of linear regression here.
In [ ]:
## Read the housing data! This time its not comma separated but space separated. Read up on how you can use Pandas
## to read in space separated files into a data frame
housing = """Read a .txt file, pay attention to how the data is separated"""
In [41]:
# See if the import worked, print the first 5 lines using some in-built function. Where's your head at?
housing.head()
In [42]:
# Check if these of the variables are correlated using the visualization techniques built up in module 1!
# LSTAT and MEDV would be related. The more the proportion of poor houses, the smaller the price.
# Lets confirm our intuition
# Scatterplot between LSTAT and MEDV
sns.jointplot(housing.LSTAT,housing.MEDV,kind="reg")
Out[42]:
It seems that the predictor LSTAT is correlated with our response and will be a good base model. Lets try to build a simple linear model using just one predictor and response, (sklearn works the same way for more predictors, you just have to put them in one dataframe).
In [74]:
# Define predictor and response
X = housing[['LSTAT']]
Y = housing.MEDV
In [77]:
# Load up the linear model and fit it.
from sklearn.linear_model import LinearRegression
lin_mod = LinearRegression()
lin_mod.fit(X,Y)
y_p = lin_mod.predict(X)
# Plot the results.
plt.scatter(X,Y,c='r')
plt.plot(X,y_p,c='y')
Out[77]:
In [76]:
## Now start making your own regression!
# Remember the potential pitfalls we discussed.
# Correlation - check the correlation of each variable with the other. Heres a correlation-map to get you
# started on what predictors should be used and which ones are highly correlated and may pose a problem!
corr = housing.corr()
sns.heatmap(corr)
plt.savefig("correl.png")
In [ ]:
# Start building your model here!
# First you'll need to separate out the predictors and response
X = """ predictors 1 through 13 """
Y = """response = MEDV"""
# You can reuse the lin_mod object to continue fitting to different data!